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Abstract 

This article describes a robust algorithm to estimate a conditional probability density 
f{t\x) as a non-parametric smooth regression function. It is based on a neural network 
and the Bayesian interpretation of the network output as a posteriori probabability. The 
network is trained using example events from history or simulation, which define the 
underlying probability density f{t,x). 

Once trained, the network is applied on new, unknown examples x, for which it can 
predict the probability distribution of the target variable t. Event- by-event knowledge 
of the smooth function f{t\x) can be very useful, e.g. in maximum likelihood fits 
or for forecasting tasks. No assumptions are necessary about the distribution, and 
non-Gaussian tails are accounted for automatically. Important quantities like median, 
mean value, left and right standard deviations, moments and expectation values of any 
function of t are readily derived from it. 

The algorithm can be considered as an event-by-event unfolding and leads to sta- 
tistically optimal reconstruction. The largest benefit of the method lies in complicated 
problems, when the measurements x are only relatively weakly correlated to the output 
t. As to assure optimal generalisation features and to avoid overfitting, the networks are 
regularised by extended versions of weight decay. The regularisation parameters are de- 
termined during the online-learning of the network by relations obtained from Bayesian 
statistics. 

Some toy Monte Carlo tests and first real application examples from high-energy 
physics and econometry are discussed. 



This article is a reprint of an internal EKP note from January 2001, corrected and 
pplemented by information on the development in the 3 years since then. The algorithm is 
implemented and further developed in the NeuroBayes®package by Phi-T®Physics 
Information Technologies GmbH, Karlsruhe, Germany. 



1 Introduction 



Assume we have a random variable t whose probability density function is known or can be 
estimated from a large but finite number of examples, e.g. the energy distribution of a certain 
very short-lived particle in a high energy physics experiment or the daily change of an equity 
price. The aim is to model the probability density for a given event or date from available 
input data. In our examples one might have one or more, perhaps correlated, not very precise 
measurements of the energy. Or one knows the actual price and the recent history plus some 
technical indicators at a given day and wants to predict what will be the probability density 
for next day's or week's price change. It is easy to know what happens on average, given by 
/(t), but one wants to have a better estimate taking into account the particular situation of 
the event under consideration, x. 

We do not just want to have a single number for t, but an estimate of the complete 
probability density, from which we then can deduce the most probable value, the mean value, 
the median, moments, but also uncertainty intervals or expectation values of any function 
of t. As an example, from the probability density describing an equity price change one can 
compute the probability density of an option price for that equity and its fair price. 

Precise measurements are easy to handle, one just identifies the measured value with the 
true value and uses classical statistical methods for error estimates etc. The method proposed 
in this article has its largest benefit for difficult predictions, when the measurements are not 
very strongly correlated to the desired output, i.e. when the individual measurements are not 
assumed to be much more precise than the width of the complete distribution averaged over 
many events. This certainly is the case in the last example of predicting future equity prices. 

The algorithm is a Bayesian estimator in the sense that it takes into account a priori 
knowledge in the form of the inclusive (unconditional) distribution. It will never result in un- 
physical values outside the training range, a very helpful feature when the input measurements 
are not very exact. The influence of the shape of the inclusive distribution diminishes with 
increasing precision of the input measurements [Ij. Bayes theorem plays a large role in two 
other places: the interpretation of network output levels as Bayesian a posteriori probabilities 
in classificiation problems, and for determining regularisation constants. 

In recent years the analysis methods in High Energy Physics experiments have been steadily 
further developed [2 3J. Neural network algorithms for classification tasks [4J have become 
very successful and - after initial scepticism in the community - essentially accepted. In the 
DELPHI experiment at LEP the author and coworkers have optimised electron identification 
IS], kaon and proton identification ^ using neural networks, and developed inclusive b-hadron 
reconstruction algorithms [7J for energy, decay length, lifetime reconstruction, B hadron iden- 
tification, particle/antiparticle distinction at production and decay time etc., which have been 
and are applied in many analyses. These are hybrid algorithms combining classical approaches 
with a large number of neural networks at different levels (single track, hemisphere, event). 
Also other modern statistical methods have been investigated, e.g. Support Vector Machines 
[8J. However, we have found [9J that for difficult non-separable classification problems often 
encountered in High Energy Physics simple neural networks with intelligent preprocessing are 
more useful. With this knowledge and experience we have searched for a method to extend 
this technology to continuous, real-valued variables t. 

The proposed algorithm is based on a simple feed-forward neural network which is trained 
using backpropagation [lOj of either simulated Monte Carlo or historical data. The cumulated 
probability distribution of the whole dataset is discretised into N bins, with variable width but 



1 



same amount of statistics inside. For each of the bins, a separate output node is trained to 
the binary classification problem "the true value is above the threshold value" vs. "the true 
value is below the threshold value". The output of the network is filtered through a symmetric 
sigmoid transfer function, as usual in binary decision nets. A cubic B-spline (see e.g. (2]) is 
fit through the filtered net output values, forced through —1 and 1 at the extreme values 
and applying Tikhonov-type regularisation [llj on the basis of the third derivatives' squares. 

This spline, properly rescaled, is a robust estimator of the cumulative probability distribution 
function of the true value for a given event. Median and quantiles are easily obtained from it, 
the derivative of the spline is an estimator of the probability density itself. This can e.g. be 
employed in maximum likelihood fits. Reconstructing the truth distribution for each individual 
event, the algorithm corresponds to an event-by-event unfolding. In fact, it can be extended 
to an unfolding procedure to estimate an unknown f{t), but this is left for future research. 

Some methods to predict conditional probability distributions from examples can already be 
found in the literature, but have been come to the author's attention only after this work was 
essentially completed. Weigend's and Srivastava's ansatz is similar to the one presented in 
this paper, also based on a feed-forward network with several output nodes, however it works 
directly on the probability density function. It does not have the aim of a smooth density 
function, does not include regularisation terms and does not include pre- and postprocessing. 
Other methods are based on kernel estimation 15]. 

Section 2 contains a detailed description of the algorithm, in section 3 the theoretical back- 
ground is laid out. In section 4 we review and develop a number of optimisations implemented 
during the development of the example nets. Section 5 and 6 describe toy and real world 
example applications from high energy physics and econometry. 

2 The algorithm 

Assume we have a random variable t which is distributed according to a probability density 
f{t). t may e.g. be the day-to-day change of an equity price or a quantum physical quantity. 
Our knowledge of this density is empirical, i.e. it is determined by a large number of examples, 
e.g. historical data or data simulated by Monte Carlo methods. For each of these examples, 
or "events", labelled i, there exists a vector of measurements Xi that are correlated to the 
variable U. An overall probability density f{t,x) is defined by the training sample. 

It is the aim of the method described in this paper to achieve a smooth estimate of 
the conditional probability density f{t\xi) for a given measurement vector Xi. If there is no 
information in the input vectors then obviously f(ti\xi) = f(ti) = f{t), i.e. the conditional 
probability is equal to the inclusive distribution. This would e.g. be true for a random walk 
model in econometry. If however there is a correlation then one should be able to get a better 
estimator for a given event, f{ti\xi), and from that one can deduce better expectation values 
and error intervals. 

The method proceeds in several steps: 

2.1 Preprocessing of the target values 

At first we perform a flattening of the distribution f{t) by an in general non-linear, monotonous 
variable transformation F : t s, such that the transformed variable s is distributed uniformly 
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between and 1: sit-in) = and s(tmax) = 1: 

s = Fit) = f f{t')dt'. (1) 

s measures the fraction of events that has lower values of t than the actual t. The probability 
density in the transformed variable g{s) is simply constant 1 in the interval [0,1]. The value 
s therefore can be interpreted as the cumulative probability density of its own distribution: 

s = Gis)= f g{s')ds'. (2) 

This is illustrated in Fig.lU 




Figure 1: Variable transformation F : t ^ s (upper right) leading to a flat output 
distribution g{s) (upper left) when t is distributed according to f{t) (lower plot). 
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A robust method of constructing F from a finite number of examples is to sort them 
(using a binary search tree) according to t. then just is i/N^vents, when tj is the i*^ 

element in the sorted list. It is practical to store the t-values in 1% bins, i.e. the 101 values 
tmin,'ti%,t2%, ■■■,'tmax- The functions F and F^^ for transformation and back transformation 
are constructed using this list and linear interpolation. However, for some problems there may 
be long but important tails (e.g. describing stock market crashes) such that by just storing 
1% quantiles information may be lost. To avoid that, a histogram with 200 equally sized bins 
between the observed minimum and maximum value is stored in addition. Later both lists are 
used for reconstruction. 

2.2 Discretisation 

The probability distribution G{s) is now sampled at equidistant levels Lj = (j — 0.5)/A^, 
j = 1,N. From the property G{s) = s, it further follows that G{0) = and G{1) = 1, such 
that N + 1 intervals are defined. All the inner intervals contain the same number of events, 
namely Nevents/N, the first and last interval half of this number. For = 10 this is illustrated 
in Fig. 1. 

2.3 Neural Net Prediction 

Classification of events into two classes is a task that neural networks can handle well. A 
possible classification task for a neural network is to separate events with t larger than a 
threshold value L from those with t < L. The main idea is that the conditional cumulated 
probability density G{s\x) can be estimated from many neural networks that perform such 
classifications at different thresholds Lj. Instead of training several independent networks, 
this is done in one net that has N output nodes, corresponding to the discretised output 
levels. In such a (usually sufficient) three layer neural net (see Fig. 2) the outputs oj are 
calculated via 

o, = s{Y.wff'-s{Y^wir'-x,)) (3) 

I k 

where the are the input values and the index / runs over the intermediate layer nodes. 
wll^'^ is the weight of the connection between node k of the first and node / of the second 
layer, and wfp^ that between node / of the second and node j of the output layer. S{x) 
is a transfer function that maps ] — oo,oo[ to the interval [—1,1], for which we apply the 
symmetric sigmoid function 

Si^) = - 1- (4) 

1 + e 

The architecture of the network is shown in FigEl 

As in common simple classification neural net training, the weights are determined using 
the iterative back propagation algorithm employing historical or simulated data, with some 
acceleration and regularisation techniques applied as outlined in section 4. 

Each output node j is trained on the binary decision: is the real output Sfme larger than 
Lj (target value = 1) or it is smaller (target = —1)? Thus the target vector of an event with 
strue = 0.63 and A^ = 10 levels would be f = (+1, +1, +1, +1, +1, +1, -1, -1, -1, -1, ). 
Note that stme = 0.56 and Stme = 0.64 would give the same target vector, this is the 
discretisation uncertainty introduced. If the individual measurement resolution is good, this 
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is a potential source of information loss. The number of intervals should be matched 
to the obtainable resolution. An alternative to reduce the discretisation information loss 
in the training is to set the target node nearest to the true value Stme to a value be- 
tween —1 and 1 as to smoothen the dependence. In this training mode the target vec- 
tor would be f = (+l,-M,-hl,-hl,-M,-F0.2,-l,-l,-l,-l,) for stme = 0.56, but 
T = (-1-1, -1-1, -1-1, -1-1, -1-1, -1-1, —0.2, —1, —1, —1, ) for Strue = 0.64. This procedure can be 
used when employing the quadratic loss function. 




Figure 2: Example architecture of the Bayesian neural net, with 8 input nodes, 10 inter- 
mediate, and = 10 output nodes. The lines denote the connections, each of which is 
associated with a weight that is optimised in the training. The post-processing consists of 
a regularised spline fit through the single nodes' output, from which the interesting final 
quantities can be calculated. 



2.4 Neural Net Output Interpretation 
After minimising the quadratic loss function 

j j « 

(where Tji denotes the target value for output node j in event i and the definiton and role of 
the Wj is as described in section 3) in the network training, and assuming that the network is 
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Neural Net Output (node 10 of 20) 
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Figure 3: Left: Signal (true s is larger than Liq = 0.475) and background distribution 
of the 10*'' node of a iV = 20 network. Right: Signal purity in each bin as function of 
network output o. The expected linear behaviour is observed, demonstrating that the net 
is well trained. 

well trained, the network output oj, rescaled to the interval [0, 1], is equal to the purity 

P{Oj) = fiStrue < L^\0,)/f{0,) = {o, + l)/2. (6) 

This can be seen as follows: The mean contribution of measured events with output o is 



X 



P-(l-of + (l-P).((-l)-o)^ 



(7) 



where the first term describes the contribution of signal events (purity P, target +1), and 
the second of background events (purity 1 — P, target —1). When the network is trained 
optimally, is minimal, i.e. dx^/do = 0. This directly leads to P = (o + l)/2. Figure El 
shows signal and background distribution and the purity in each bin as a function of the net 
output for the 10*^ node of a = 20 network. Thus (oj + l)/2 is the probability that the 
true value really lies below the given threshold Lj. An interpolation through the rescaled 
network outputs is the desired conditional cumulated probability density G{s\x). 
The same arguments also hold for the entropy loss function: 



E 



D 



^,5:iog(-(i+T, 



+ e)) 



which we prefer since it has some advantages in classification problems. In this case a com- 
pletely wrong classification Oj = 1 for tj = —1 or vice versa leads to an infinitely large Ed. To 
get rid of such completely wrong classifications is thus the first thing learned by the network 
using the entropy error function. In order to avoid numerical problems for untrained networks. 
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a small regularistaion constant e is introduced, e is reduced in each training iteration and is 
zero after just a few iterations. 

Moreover, the absolute value of Ed has a meaning in Bayesian statistics avoiding an extra 
regularisation constant in the weight decay regularisation scheme (see section IT4|) . 

2.5 Interpolating the discrete network outputs 

In order to get a closed functional form and to be able to calculate G{s,x) and its derivative, 
the probability density g{s,x), at any value s, a cubic B-spline fit with variable bin size is 
applied through the end points (0,1), (1,-1) and the points (Lj,Oj) estimated by the 
neural network. It is constructed from a 4-fold knot at s = 0, a 4-fold knot at s = 1, and 
a number of simple knots placed in between and 1. The actual number should be chosen 
smaller than the number of output nodes N, as to smoothen statistical network fluctuations. 
It has been proven useful to not distribute the nodes equidistantly, but to place more in the 
regions of a large third derivative. 

We want to achieve a smooth distribution function g{s\x). This is achieved by the follow- 
ing Tikhonov type regularisation procedure: The total curvature c = jQ{g{s\x)"yds should 
be as small as possible with G{s\x) still giving a good description of the output levels. 
Constructing G{s\x) using cubic B-splines, c is just the bin-size weighted sum of the third 
derivatives' squares of the spline. These are constants between any two knots of the spline, 
and can be easily added to the normal equations. 

Thus the regularised spline fits stay linear fits which only demand an inversion of a sym- 
metric N X N band matrix of band width 4, which is performed very quickly [2J. In addition, 
in order that F(t\x) can be interpreted as cumulated probability distribution, the constraints 
of monotoniticity, positivity and a maximum value of 1 must be satisfied. In order to keep 
the fits linear and fast, these unequality constraints are fulfilled by the following procedure: 
First a normal fit is performed , this can be done analytically in one step. If the result does 
not fulfill all requirements, a quadratic loss function multiplied by a large number is added to 
the normal equations of the corresponding parameter or difference of parameters. This way a 
new fit is expected to result at exactly the border of the allowed region. Then the procedure 
is iterated, until all constraints are fulfilled. Accepting numerical uncertainties of 10^^, one 
or two iterations are usually enough. This procedure works well and is considerably easier to 
program than iterative usage of Lagrange multipliers to fulfill the unequality constraints. 

The probability density g{s\x) can be calculated analytically from the spline parametrisa- 
tion. To transform back to the original variable t the inverse mapping F^^ has to be applied. 
If F is stored as 101 interval borders as described above, t can be retrieved from searching in 
the list and linear interpolation. The function f{t\x) = f{t) ■ g(F{t),x) also is calculated as 
a 100 value table with linear interpolation. 

2.6 Important Quantities: 

The median of the conditional distribution f{t\x) is calculated from 



Left and right error intervals may be defined as in Gaussian distributions as the limits 
of the interval that contains 68.26% of the data: 



^^ = ^"'(^-^(0.5)). 



(9) 



cxfe/i = F-\G-\0.5)) - F-\G-\0M13)) 



(10) 
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and 

"^right 

The expectation or mean value (t) can be estimated from 



(T„ght = F-\G-\0.1587)) - F-\G-\0.5)) (11) 



(t) = J t'f{t'\x)dt' (12) 
= J t'{s')g{s'\x)ds' (13) 
^ J2F-\sMsjW) (14) 



M 



l/Mj2F-\iG-\{m-0.5)/M)) (15) 



m=l 



This latter expression is particularly simple to calculate: One just chooses M equidistant points 
ym between and 1, and performs the two back transformations F^^(G^^(y)) successively on 
them to achieve the corresponding values, and takes their average. This simple operation 
even is an optimal estimator of the integral at a given number of function evaluations: It 
contains an implicit importance sampling. 

The expectation value for any function a(t) of t is estimated from 



(a(t)) = J a{t')f{t'\x)dt' (16) 
^ Y.a{F-\s^))g{s,\x) (17) 



M 



l/M J2 a{F-\{G-\{m - 0.5)/M))) (18) 



m=l 



Again the last expression delivers a very simple and effective way of calculation. An example 
application is option price calculation, leading to a better prediction than the Nobel-prized 
Black-Scholes formula |17j. 



3 Theoretical framework 

Here we shortly summarise the mathematical framework from statistics and learning theory of 
the heuristic method explained in the previous section. 

3.1 The learning problem 

A probability density f{t) is defined from the following Fredholm integral equation of first 
kind: 

f f(t')dt' = F(t). (19) 

The probability distribution function F(t) for a scalar random variable is defined as the prob- 
ability that a random realisation has a value smaller than t: 

F{t) = P{e < t}. (20) 
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For a vector quantity, F{x) is defined coordinatewise: 



f{x')dx'= ... f{x[,...,x'Jdx[...dxl (21) 

-oo J —oo J —oo 

A conditional probability density f{t\x) is defined by 

/* r f{t'\x')dF{x')dt' = F{t,x). (22) 

J —oo J —oo 

Here F{x) is the distribution of random vectors x, and F{t, x) is the joint distribution function 
of pairs (t, x). Since dF{x) = f{x)dx and f{t\x) = f{t,x)/f{x), f{t\x) is a conditional 
probability density. Equation |221 is an example of a linear operator equation Af = F, where 
both the right side F and the operator A (depending on F{x) ^ Fi{x)) are only approximately 
known. 

Our problem is to estimate a non-parametric solution f{t\x) of integral equation where 
the probability distributions F(x) and F(t, x) are unknown but defined from a finite number of 
pairs (ti, xi), (ti, xi). This means that F(x) and F(t, x) are approximated by the empirical 
distribution functions 

1 ' 

^; = tE0(^-^*)-0(^-^"'O (23) 
' i=i 

where 

denotes the Heaviside step function. The Glivenko-Cantelli theorem [14j assures that this 
empirical distribution function Fi is a good estimate of the actual distribution function F, 
and uniform convergence takes place as / ^ oo. Kolmogorov and Smirnov have shown that 
the convergence is proportional to \/l [IS]. For one-dimensional probability distributions, the 
deviation is limited by 

lim P{Vl snp \F{t) - Fi{t)\ > e} = 6"^"'. (25) 

l^oo t 

For the multidimensional case the limit is unknown but known to exist [H]. 
3.2 Solving the ill-posed problem 

To solve the integral equation the quadratic risk function W = \\Af — F|p is minimised. 
Since we approximate the real risk by the risk calculated from the given events, one talks 
about empirical risk minimisation (ERM). 

Solving such a Fredholm integral equation of first kind is known to be an ill-posed problem. 
This means that it cannot be assured that the solution is stable. An infinitesimal change on 
the right hand side may lead to a large variation in the solution f{t\x). Oscillatory behaviour is 
a typical phenomenon of solutions of ill-posed problems. Several regularisation schemes have 
been proposed to stabilise the solutions, putting in some theoretical prejudice. One of the 
most common techniques is the Tikhonov regularisation scheme [llj. Here the risk function 
to be minimised is modified to Wt = \\Af — F]]"^ + ^n(f) by adding a regularisation term. 
A possible ^{f) could be the total curvature of the solution / if one knows it to be a smooth 



9 



function. The regularisation parameter 7 > has to be chosen such that the original risk W 
is still small and simultaneously good smoothness is obtained. 

Another important ansatz is structural risk minimisation (SRM) [8J. Even in so-called 
non-parametric estimations the solution is constructed from a number of basis functions, e.g. 
polynomials, splines, or some orthogonal function set. Learning then means to determine the 
free parameters that e.g. describe the amount of each basis function. 

3.3 Generalisation error and VC dimension 

Learning theory has shown that generalisation ability is best when only few parameters are 
necessary. However, not only the number of parameters is decisive, but also the structure 
of the functions. So high frequency oscillatory functions may contain only few parameters 
but can lead to very unstable solutions. Vapnik and Chernovenkis [16j have introduced the 
concept of the VC dimension. A small VC dimension means good generalisation ability of a 
learning machine, whereas infinite VC dimension denotes just a learning by heart of the given 
events and no generalisation ability at all. For affine linear functions in n-dimensional space, 
the VC dimension simply \s n + 1. This is not true in general, for non-linear functions it may 
be much larger. 

For neural network training, a structure may be defined by the network architecture. The 
more hidden nodes are introduced, the smaller the empirical risk, but the larger the structural 
risk. Overfitting may occur and generalisation ability degrades when too many nodes are used. 
This requires careful control. 

The VC dimension of neural networks can decrease significantly when the weights are 
bounded to be small jTH]- However, independent of the size of the weights, a lower limit on 
the VC dimension of a three-layer neural network with n input nodes and k hidden nodes has 
been established [19j: 

VCdim> (k-l) ■ (n + l), (26) 

equal to the number of weights of a network with n — 1 input nodes and k — 1 hidden nodes. 
VC dimension is a worst case concept, and there is substantial evidence that improved gener- 
alisation performance is achieved when networks are trained with weight decay regularisation 
j20| . This can be understood in a Bayesian framework which also allows to choose 
regularisation parameters Oc and P- 

M = J2a,E'^{w\A) + (3Ed{D\w,A) (27) 

c 

where 

EniD\w,A)= E E ho^in-trf (28) 

events m output nodes i 

is the quadratic error function of the data and 

E'wHA)= E (29) 

i weights 

is the regularisation energy (which leads to a subtraction of a multiple of w in the weight 
update rule). It has been shown[21J to be useful to separate this term into three (c = 1,2,3) 
terms, which determine the weight decay of the weights of the input measurements to the 
second layer, the bias node to the second layer, and the second to the output layer, since there 
is no a priori reason why these weights should be in the same order of magnitude. 
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3.4 Bayesian approach to regularisation 

The Bayesian approach in j2I] leads to the following optimal parameters: The point of maximal 
probability P{D\a,P) to observe the observed data D has the following property: 

xl, = 2aEw = 7 (30) 
xl = 2PEn = iV-7 (31) 

where 7 is the effective number of degrees of freedom: 

a=l ^a + a 

where Aa are the eigenvalues of the quadratic form PEn in the natural basis of Ew- If the 
entropy function eq.|H]is used instead of the quadratic loss function ()28jl . there is no need for 
a /3 to be estimated from the data 

Calculating the Hessian matrix H = VVM allows to distinguish between well and poorly 
determined parameters, and to calculate error bars on weights and net outputs. This is not 
yet implemented in NeuroBayes. 



3.5 Pruning 

When during the training weights become completely insignificant (less than O.OOlcr), the 
connections are pruned away, i.e. set to exactly zero. Thus, the architecture is changed and 
the number of free parameters is lowered. The VC-dimension explicitely is reduced by this pro- 
cedure. It is interesting to note that something similar works in biology: neurobiologists have 
found that the intelligent mature brain has less connections between the neurons. In contrast, 
an untrained (young) neural network still has many connections. During its development the 
brain actually loses the neuronal connections that are less used, and forms strong connections 
in those synaptic circuits that have been utilized the most. Pruning improves the signal-to- 
noise ratio in the brain by removing the cause of the noise. This process is constant and 
quick. Synaptic connections can form in a matter of hours or days. Experience, particularly 
in childhood up to early adulthood, sculpts the brain [23J. 



3.6 Automatic Relevance Determination 

MacKay [21j has introduced an interesting extension of the weight decay procedure by intro- 
ducing a hyperparameter for each individual input node. This is called "automatic relevance 
determination" and allows the algorithm to better prune away unsignificant input variables. 
We have found this extension to be useful. 



3.7 Automatic Shape Smoothing 

A direct copy of the same concept, for the connections to the different output nodes, called 
Automatic Shape Smoothing, also is implemented in NeuroBayes. 



4 Some considerations for efficient net training 

Here we list a few experiences we have gained during development of this code. 
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4.1 Online instead of batch training mode 

The training is performed in stochastic, mini-batch or quasi-online mode: A weight update is 
performed after about 200 events. Since the problems to be solved by NeuroBayes typically 
learn from relatively large and partly redundant training sets, online learning is much faster 
than batch mode, where weight updates are only performed after all training examples have 
been read 

4.2 Random start values for weights 

We preset all weights with random numbers distributed around zero. To ensure a fast initial 
learning, it is useful to take Gaussian random numbers with a a = 1 / i/rim , where nj„ is 
the fan-in of a neuron, i.e. the number of incoming weights. If the input variables are also 
distributed like a standard Gaussian (see below), then this is true also for the output of the 
hidden layer nodes, such that the same argument holds for the output layer. This ensures 
optimal learning conditions directly from the start. 

4.3 Non-randomness of start values in the second and output layer 

However, since we expect in each single event that the net output oj is a monotonous function 
of j, we start off with the same random weight wfp^ for all j for a given /. During the training 
they will quickly change, but due to the monotonous targets of all training events this feature 
will survive at each training step and thus reduce statistical uncertainties in the reconstruction 
of the probability density function. 

4.4 Relative importance of the N output nodes 

What are the maximum errors that the single output nodes can contribute? Using the quadratic 
loss function, this can be estimated from 

X? ^ N £ gio,)iP, ■ (o, - 1)2 + (1 - P,) • (o, + m (33) 

where g{oj) is the distribution function of output node j and the purity Pj is the fraction of 
events with target -|-1. In an untrained net with random weights, when the outputs still are 
completely random, g{oj) = 1/2, the mean is x^/^ = 4/3 independent of Pj. 

The next extreme is that the network learns the target = +1 to target = —1 ratio for 
each output node, but nothing else. This corresponds to g{oj) = 5{oj — 2Pj + 1) and a mean 
X^/N = APj{l~Pj), which is 1 for Pj = 50%, and down to 0.0975 at Pj = 2.5% and 97.5%. 

This is a trivial learning, i.e. the net learns the inclusive distribution, but no improvement 
from the individual input vectors. It thus is useful to know about this (constant) contribution 
from Xj,indusive- To follow and judge the quality of the training we compare each node's Xj 
to the X^j inclusive- R^tios below 1 indicate real learning beyond the shape of the inclusive 
distribution. Sometimes it has been useful to give a larger weight to the extremal nodes after 
some iterations during the backpropagation step, such that small systematic effects here have 
a chance to be learned compared to the large statistical fluctuations of the inner nodes. 

This is not necessary with the entropy error function. In this case, learning just the inclusive 
distribution corresponds to Ed/N = Pj log (Pj) + (1 - Pj) log (1 - Pj). 
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4.5 Learning with weighted events: 



It is possible to train the network with weighted events. This is useful if e.g. one set of Monte 
Carlo simulation events is available but single input distributions (e.g. lifetimes or branching 
ratios) must be adjusted to new (better) values. For time series prediction one may want 
to give more recent data more weight than ancient data, without completely neglecting the 
latter. 

A too large variation of the weights however drastically reduces the statistical accuracy of 
the learning, the "effective" number of training events decreases, and statistical fluctuations 
of the large weight events may start to fool the network. 

The backpropagation algorithm is adjusted by multiplying the difference of output and 
target by the weight. The mean weight should be one. 

4.6 Bias nodes 

The introduction of bias nodes, i.e. nodes which have a constant value 1, are generally 
extremely useful for net learning. A bias node in the input layer is clearly advantageous. 
However, in this multi-output net the existence of a bias node in the second layer has proven 
not to be useful and may lead to convergence into a non-optimal local minimum. The reason 
is that at the beginning of the training the error is dominated by the vastly different number 
of signal and background events for the non-central output nodes. For example, just 2.5% of 
the data has truth values below the first threshold for N — 20 levels. The fastest way to learn 
this fact is to shift the threshold of the final sigmoid transfer function to —1.9, which is easily 
achieved when a bias node is available in the second layer. Only later the other variables are 
looked at, however now at the expense that often the relevant output node already is saturated. 
Thus, learning is more difficult. In practice, leaving out a bias node in the intermediate layers 
always has led to smaller values. 

4.7 Shifts in individual output transfer functions 

We noticed that at least in difficult (low correlation) cases it may be of advantage to shift the 
weighted sum of the output units such that, if the weighted sum is zero, the inclusive distri- 
bution results automatically. This is achieved by the following replacement of the argument 
of the transfer function of the output node j: 

S{J2wijai) ^ S{J2wijai + S~\l - 2 * Pj)) (34) 

where Pj denotes the nominal probability level of the inclusive distribution. 

4.8 Fast function evaluation 

To speed up the learning process it is important that function evaluation is as fast as possible 
in the inner learning loop. It can for example be accelerated by tabulating the sigmoid function 
and using linear interpolation. 

4.9 Dynamic step size adaptation 

We use the following procedures to speed up learning: From the first few thousand events 
the largest eigenvalue of the Hessian (i.e. second derivative) matrix of with respect to the 
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Figure 4: Left: True distribution f{t) (black) along with distribution of two measured 
quantities (green, red). Right: Correlation of one of the measured quantities with the 
truth. 

weights is calculated according to the recipe of [211 I2ZI. exploiting the power method and 
Rayleigh quotients ||28j. This defines the largest learning rate. 

4.10 Randomising event order 

During learning the network usually sees the events always in the same order. Since the weights 
are updated every 200 or so events, the state of the network at the end of a training period 
has a slight bias towards the last events. We have found it useful to randomise the order of the 
events in each iteration. This avoids the bias to always be the same. Checking the network 
performance on a test sample after each iteration one can observe and estimate the size of the 
statistical fluctuations due to this effect. We have found that - especially in difficult learning 
patterns - the randomisation of the order improves convergence. The chance that the network 
always moves along the same path and gets stuck in a local minimum is minimised. 

4.11 Choosing input variables 

There are two important classes of input variables that should be considered: those which are 
correlated with the true output give information about the deviation of the mean and median 
from the unconditional distribution, and quality variables which determine the width of the 
conditional probability distribution, e.g. estimated measurement errors. Both types should be 
fed into the net. 



14 



4.12 Preprocessing of input variables 



A completely automatic and robust preprocessing procedure has been developed. First all input 
variable distributions are equalised just like the output variable described above (by sorting). 
This is especially important because otherwise (probably wrong or unreliable) extreme outliers 
in one variable can completely saturate neurons and thus dominate the net output. The 
equalised variables are scaled to lie between -1 and 1. Using the inverse of the integrated 
function, the flat distributions are then converted into Gaussian distributions, centered at zero 
with standard deviation 1. This is important for two reasons: a nonzero mean value implies a 
large eigenvalue of the Hessian in weight space that restricts the initial allowed learning rate 
[24j . And a width of one, together with the random weights preset as described above, makes 
sure that also the inputs to the output layer are distributed with mean zero and width one, 
thus providing optimal conditions for a fast initial learning and avoiding neuron saturation. 

4.13 Decorrelation of input variables 

Now all continuous input variables have a Gaussian shape with mean zero and width one. 
However, they still may be correlated. For network training it is advantageous to decorrelate 
the input variables. To do this, first the covariance matrix of the preprocessed input variables 
is calculated. Then it is diagonalised using iterative Jacobian rotations [2J. The rotated input 
vectors are divided by the square root of the corresponding eigenvalue. This way the covariance 
matrix of the transformed variables is a unit matrix. 

4.14 Linear optimisation towards one input node 

Once the inputs are prepared this way, the covariance "ellipsoid" is a sphere, and any rotation 
applied to the input vectors leaves them orthogonal and normalised. Now the correlation 
coefficent of each of the eigenvectors to the true target value is calculated. Using a sequence 
of iV — 1 rotations, it is possible to rotate this unit matrix such that only one variable shows 
the complete linear correlation to the target, and all other correlations are zero. The resulting 
correlation is the best one can achieve with a linear method. It also has been proven useful to 
rotate the complete correlation to the second moment of the target distribution to the second 
input variable etc. Exploiting the non-linear neural network, more is possible. 

4.15 An alternative mapping of the input variables 

Another possibility to increase net learning speed is to perform another variable transformation: 
The mean performance is plotted against the equalised input variables. If one observes a 
monotonuous but non-linear correlation, it is useful to fit this dependence and take the fitted 
function value as input instead of the original variable. Although this non-linear function 
should also be learned by the net, this preprocessing may help in finding the global minimum 
or at least a good local minimum. 

4.16 Why learn the cumulative distribution? 

This algorithm learns the cumulative distibution in each output node, not the interesting 
distribution itself. One may ask why this is necessary: one could instead directly train the 
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network outputs to the true t value lying in a given interval, since neural networks are able 
to map every function. This is correct. However, experience shows that is more difficult for 
a network to learn it, especially when the correlation is weak. Some more subtleties occur: 
The mean output sum of such a (multinomial) multi-output network is 1, but event-by-event 
fluctuations lead to deviations. Smoothness is thus more difficult to achieve. Also, knowledge 
of the cumulative distribtion readily allows for calculating expectation values in a statistically 
optimal way. 

5 Monte Carlo tests 

5.1 A simple toy example: two measurements of the same variable 

A Monte Carlo experiment shows how the method works and how it takes into account the 
a priori knowledge of the inclusive distribution. 50000 events distributed according to f(t) = 
5/8 ■ ■ (2 — t) have been simulated. For each event two independent "measurements" 
with Gaussian smearing are available: One with a constant a = 0.4, and another one with 
a = ^0.25^ + (0.35 ■ (2. - t))^. Fig. El shows f{t) along with the distributions of the two 
measurements. Observe that a part of the measurements yield values outside of the physical 
region. A simple weighted mean cannot avoid this, but the Bayesian network will avoid it. 
The right side shows the correlation of the more precise measurement with the truth. Four 
input variables are chosen, the two measurements and estimations (with 15% smearing) of 
their uncertainties. Of course, for the second measurement the error calculation cannot use 
the true t, but has to estimate it from the very uncertain measurement. The equalisation of 
the target variable is shown in Fig. 1. The output of three example events are shown in Fig. |5l 
Two of the examples can well be reconstructed from the measurements (steep curves top left), 
whereas the other only had weak additional information from the measurements, such that 
the result differs only slightly from the inclusive distribution. Top right of the same figure 
shows the conditional probability densities g{s,x) of the transformed variable. A constant 
distribution here means that there is no more information than the inclusive distribution. One 
of the example events exhibits such a behaviour, only very low and higher values are suppressed. 
On the bottom of Fig.Elthe final output f{t\x) is shown for these example events, along with 
their mean values, error estimates, median and maximum likelihood estimate. Note that not 
only an expected value, but also an error estimate is given event by event. Deviations from 
Gaussian behaviour is clearly visible, partly due to the limits of the physical region. 

Fig.l^shows the resolution (distribution of estimated minus true values) of the maximum 
likelihood estimator derived from f{t\x), as compared to the weighted average of the two 
measurements that enter the neural net. A marked improvement is visible. Also the estimated 
uncertainty is important and good information: Note how the resolution improves with a 
smaller error estimate (here taken as sum of left and right standard deviations). 
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Figure 5: Top left: Network output o vs. purity P{s) = s of output node for three 
events. The boxes denote the 20 network outputs, the hues are the sphne fits G{s\x). 
By construction, the unconditional G{s) is the diagonal from (0, 1) to (1, —1). Top right: 
the conditional probability density g{s\x) for the three events, determined from the first 
derivative of the spline fit. Bottom: The conditional probability density f{t\x) for the 
three example events. In all cases also the true values, mean, median and left and right 
error are indicated. The black histogram is the inclusive distribution f{t). 

5.2 More sophisticated: A function of two different measured variables with 
error propagation: lifetime reconstruction 

The proper lifetime t of a relativistic particle can be reconstructed from the decay length d 
and the momentum p via 

md , ^ 

t = 35 

c p 
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Figure 6: Left: Resolution of the maximum likelihood estimate of t (black), compared 
to the resolution of the weighted mean of both input measurements (red curve). Right: 
Error estimate {crieft + (bright) vs. deviation of the maximum likelihood estimator from the 
true t values. 



where m and c are constant and known quantities, the particle mass and the speed of light. 
The inclusive distribution is an exponential decay law: f(t) = , where r is the mean 
lifetime. The numbers taken for this example are typical for the measurement of 6-hadron 
lifetimes using inclusive reconstruction in DELPHI jTj. The usual way is to measure d and p 
separately, and then to divide them. However, both measurements can have relatively large 
errors and unphysical values, in which case this is not the optimal thing to do. 

Here we investigate whether the net also can learn that it has to divide two input num- 
bers and how to combine the measurement errors. Since three layer neural networks can 
approximate any function this should also work in this configuration. 

A toy Monte Carlo has shown that this is feasable. The momentum distribution is generated 
as in the toy example above, scaled to between and AhGeV, and decay length from an 
exponential distribution in lifetime t with r = l.'ops and calculation of d. Decay length 
resolution is simulated as Gaussian smearing of width 250 /im, and momentum resolution as 

d 

Gaussian smearing of width (10 — p/5) GeV . The acceptance is modelled as A{d) = 1 — e ''o , 
in addition events with negative measured decay lengths are rejected. 

A neural network is then trained with just two input nodes, the measured d and measured 
p, and the true proper time t to define the targets. The results are very promising: Some 
example events are shown in Fig. |71 Obviously the net can learn the division and the error 
propagation. The right plot shows the deviation of the NeuroBayes maximum likelihood and 
median estimators along with that of the direct ratio of measured d and p (shaded area). They 
have about the same resolution, but do not show the large upwards tail. 

This demonstrates that at least simple functional combinations of different measured quan- 
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Figure 7: NeuroBayes lifetime reconstruction from measured decay length and momentum. 
Left: f{t\x) for some events. The widths show the generally expected tendency of getting 
broader with increasing t. Different widths at the same mean t are learned from the 
different combinations of d and p leading to the same t. 

Right: Deviation of NeuroBayes median (black) and max likelihood (red) estimators from 
the true t values, compared to the direct ratio of measured variables (shaded). The 
resolution is similar, but the right tail vanishes due to the Bayesian ansatz. 



titles can directly be learned by such an approach. This can be extremely helpful since also 
the (non-Gaussian) error propagation is handled correctly. 



6 Real application examples 

6.1 A difficult case: Improved B energy reconstruction in DELPHI 

b-hadrons may decay into thousands of different channels. BSAURUS [7] contains algorithms 
that try to estimate the energy of a b-hadron inclusively, without knowing which the actual 
decay channel was. The currently best estimate is built from track rapidity exploits the 
track neural network, the missing energy in the hemisphere and the invariant mass of the 
particles which probably stem from a B decay. A NeuroBayes network has been set up and 
trained to try to improve its performance. Twenty input variables were chosen, which include 
different estimators of the energy available in BSAURUS, their input quantities as well as 
some measures of the measurement uncertainty like hemisphere multiplicity, reconstruction 
quality flag, B-fragmentation distinction quality, number of identified leptons and number of 
reconstructed jets. 

Compared to the best classical BSAURUS estimator a non-negligiable improvement could 
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Figure 8: NeuroBayes inclusive B energy reconstruction for the DELPHI experiment. 
Top left: resolution of the currently best BSAURUS energy estimator (shaded) with 
NeuroBayes median and max likelihood estimator. Top right: resolution of NeuroBayes 
median estimator for 4 intervalls of its error estimate: a < 2.25 GeV, 2.25 < a < 3GeV, 
3 < 0" < 4 GeV and cr > 4 GeV (shaded). Bottom left: Pull distribution of the NeuroBayes 
median estimator. o"l has been used when the estimated value was below the true value 
and vice versa. It appoximates well the standard Gaussian of width 1. Bottom right: 
Distribution of the maximal bins in g{s\x) for many events (dark histogram) and the 
mean g{s\x) of all events (shaded area- should be flat). 



be achieved (top left in Fig.|Hl). Moreover, for the first time good error estimates are available: 
This can be seen in the top right plot which shows the resolution in four bins of the estimated 
uncertainty. The corresponding resolutions clearly are very different, it thus is possible to 
select events with especially good reconstructed energy. This is very helpful, for example in 
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spectroscopy analyses. The lower left plot shows the pull distribution, i.e. the deviation of the 
median energy estimate from the true value, divided by its estimated error, ai or ar, depending 
on the sign of the deviation. The mean width is compatible with 1 on both sides. However, 
the right side shows a stronger, non-Gaussian tail which has its origin in low energetic events. 
The single f{t\x) distributions clearly exhibit this non-Gaussian behaviour. The lower right 
plot shows the distribution of the maxima of g{s\x), which should and in the best of all worlds 
would be flat. It is however clearly visible that the method often pushes the maxima into 
the extreme values and 1, when the truth is near those values. The shaded area shows the 
average g{s\x), obtained by adding many events. This also should be flat, and it is already 
much flatter than the maximum positions. 



6.2 A very difficult case: Econometry 

This algorithm is also very useful in time series predictions |2T]|22]- An important application 
of time series prediction is financial forecasting, and maybe here the algorithm brings the largest 
benefits since the correlations to the target are very small und hidden under large stochastic 
background. Historical data can be used to train the network. 

To predict the future of an equity price is of course a very difficult problem, since it is very 
uncertain and clearly depends on many unforeseeable facts. The most robust estimate is that 
the mean value will stay constant. On the other hand, it is known that over many years there 
is a rise observable. A first look to the data suggests that the "random walk model" works 
very well. This model suggests that every day's price change is truely random with mean value 
zero, e.g. according to a Gaussian distribution 

f(t,x) = f{t) = -^e-^ (36) 

and there is no more information in the price history from which x can be constructed. This 
states that it is impossible to predict the direction, but the standard deviation a (the "volatil- 
ity") can be estimated from history. Lots of physicists and mathematicians are now hired by 
banks to do quantitative risk calculations and develop strategies [29j. 

Nevertheless some features which probably lie in human nature (greed and fear) make 
it possible to look a bit into the future. This is the basis of what is known as "technical 
analysis". There is an enormous literature on this subject, see e.g. [30j. Many of the common 
"technical" concepts seem a bit esoteric and subjective to a physicist and there is very few 
quantitative information about the reliability published. 

Without going into details - just saying that a clever choice of input variables with intelli- 
gent preprocessing is an important prerequisite - it is possible to observe short time correlations 
from the past to the future of equity prices. Of course it must be made sure that enough 
input data is taken, such that the net does not only learn statistical fluctuations "by heart". 

Training on 20 years' data of the 30 equities of the Dow Jones Industrial Index with a 
relatively simple input model leads to the performance shown in Fig. El A clear dependence of 
the mean true performance as function of the mean of the estimated performance is visible, 
which still is small, but not at all negligible compared to the spread. The left side of that figure 
shows the resulting conditional probability densities for three examples. In addition to finding 
optimal combinations of different technical indicators on the basis of history, NeuroBayes does 
not only give an indication of "up" or "down" but quantifies it day by day and equity by equity. 
This could help in decision finding and especially timing of financial transactions. 



21 



t 



< t > estimate 



Figure 9: Left: Three examples of estimated conditional probabilities for a measure for 
10 working day price changes of American Dow Jones equities. Right: Correlation of 
the true 10 days' performance to the estimated mean performance. A clear correlation is 
visible. 

The method may also be used for determining a fair price for a call option when t is the price 
of the underlying at the date of maturity, the distribution of which is estimated by f{t), and 
S is the strike price: Neglecting interest rates (can be included), a(t) = min{t — 5,0.). The 
famous Black-Scholes formula (T7j is just a special case of this pricing model for the random 
walk model shown above, i.e. a simple Gaussian with zero mean (t) = and "volatility" a. 

7 Developments from 2001 to 2003 

The NeuroBayes algorithm has been completely receded in a structured way by the Phi-T 
project, J. Bossert, A. Heiss and the author, between 2001 and 2002. Phi-T was sponsered by 
the exist-seed program of the German Bundesminsterium fur Bildung und Forschung BMBF 
with the aim to found a company which makes the technology available outside physics. In 
October 2002, the Phi-T Physics Information Technologies GmbH was founded in Karlsruhe. 
Phi-T has developed lots of additions and improvements to the algorithm and the code, 
bindings for several programming languages and several input/output interfaces. They use 
NeuroBayes with large success for insurance and banking applications [33j. 

Also in physics research NeuroBayes and its predecessors have found a lot of very successful 
applications: 

• Measurement of the b-fragmentation function with DELPHI PH 

• Precision lifetime measurements of and mesons ITT] 
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• Spectroscopy of orbitally excited B mesons, in particular discovery of mesons jUJEH] 

• Search for resonances in the final state Bn^n^ with DELPHI [39J, 

• Electron identification in the CDF experiment at Fermilab PU] 

8 Summary 

A simple but effective and robust algorithm has been described that calculates the conditional 
probability density f{t\x) from (simulated or historical) example events for multidimensional 
input data. It is using a smoothing function over a number of neural network outputs, each 
of them delivering a Bayesian a posteriori probability that the real value lies above the cor- 
responding threshold. Completely automatic and parameter-free input variable preprocessing 
and stochastic second order methods to adapt individual learning rates for each weight speed 
up the quasi-online learning by at least an order of magnitude compared to plain-vanilla back- 
propagation. The generalisation ability is controlled by a weight decay regularisation with 
several independent regularisation constants that all are chosen and recalculated online during 
the training using a Bayesian reasoning. 

Several toy examples and some real examples have been tested. The procedure is an event- 
by-event unfolding with multidimensional input. It can directly be used in maximum likelihood 
fits. It allows to handle difficult measurements in a large number of application areas in an 
optimal way. 
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